regex pattern for extracting URLs
regex pattern for extracting URLs
am 23.10.2009 19:23:44 von Brad Fuller
I'm looking for a regular expression to accomplish a specific task.
I'm hoping someone who's really good at regex patterns can lend a quick han=
d.
I need a regex pattern that will grab URLs out of HTML that have a
certain link text. (i.e. the word "Continue")
This is what I have so far but it does not work properly (If there are
other attributes in the tag it returns them as part of the URL.)
=A0 preg_match_all('#]*href\s*=3D\s*([\"\']+)([^>]+?)(\1|>)=
>Continue#i',
$html, $matches);
It needs to be able to extract the URL and disregard arbitrary
attributes in the HTML tag
Test it with the following examples:
onlick=3D"someFunction('foo','bar')">Continue
Please reply
Your help is much appreciated.
Thanks in advance,
Brad F.
--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php
Re: regex pattern for extracting URLs
am 23.10.2009 19:27:45 von List Manager
Brad Fuller wrote:
> I'm looking for a regular expression to accomplish a specific task.
>
> I'm hoping someone who's really good at regex patterns can lend a quick hand.
>
> I need a regex pattern that will grab URLs out of HTML that have a
> certain link text. (i.e. the word "Continue")
>
> This is what I have so far but it does not work properly (If there are
> other attributes in the tag it returns them as part of the URL.)
>
> preg_match_all('#]*href\s*=\s*([\"\']+)([^>]+?)(\1|>)>Continue#i',
> $html, $matches);
>
> It needs to be able to extract the URL and disregard arbitrary
> attributes in the HTML tag
>
> Test it with the following examples:
>
>
>
>
>
> onlick="someFunction('foo','bar')">Continue
>
> Please reply
>
> Your help is much appreciated.
>
> Thanks in advance,
> Brad F.
>
Looking at this document from an XML standpoint, I could see doing this rather
easily. Without having to use regex. You might look into using DomDocument and
simpleXML to complete the task.
--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php
Re: regex pattern for extracting URLs
am 23.10.2009 19:28:52 von Ashley Sheridan
--=-Q6zpjDe5ipqY+9ZWd5RN
Content-Type: text/plain
Content-Transfer-Encoding: 7bit
On Fri, 2009-10-23 at 13:23 -0400, Brad Fuller wrote:
> I'm looking for a regular expression to accomplish a specific task.
>
> I'm hoping someone who's really good at regex patterns can lend a quick hand.
>
> I need a regex pattern that will grab URLs out of HTML that have a
> certain link text. (i.e. the word "Continue")
>
> This is what I have so far but it does not work properly (If there are
> other attributes in the tag it returns them as part of the URL.)
>
> preg_match_all('#]*href\s*=\s*([\"\']+)([^>]+?)(\1|>)>Continue#i',
> $html, $matches);
>
> It needs to be able to extract the URL and disregard arbitrary
> attributes in the HTML tag
>
> Test it with the following examples:
>
>
>
>
>
> onlick="someFunction('foo','bar')">Continue
>
> Please reply
>
> Your help is much appreciated.
>
> Thanks in advance,
> Brad F.
>
preg_match_all('#]*href\s*=\s*[\"\']+([^
\"\']+?).+?>Continue#i', $html, $matches);
I just changed your regex a bit. What your regex was previously doing
was matching everything from the first quote after the href= right up
until the first > it found, which would usually be the one that closes
the opening tag. You could make it a bit more intelligent if you wished
with backreferencing to make sure it matches against the same type of
quotation character it matched as the start of the href's value.
Thanks,
Ash
http://www.ashleysheridan.co.uk
--=-Q6zpjDe5ipqY+9ZWd5RN--
Re: regex pattern for extracting URLs
am 23.10.2009 19:45:39 von Brad Fuller
--0016e648f67ebb453104769dc7f8
Content-Type: text/plain; charset=ISO-8859-1
On Fri, Oct 23, 2009 at 1:28 PM, Ashley Sheridan
wrote:
> On Fri, 2009-10-23 at 13:23 -0400, Brad Fuller wrote:
>
> I'm looking for a regular expression to accomplish a specific task.
>
> I'm hoping someone who's really good at regex patterns can lend a quick hand.
>
> I need a regex pattern that will grab URLs out of HTML that have a
> certain link text. (i.e. the word "Continue")
>
> This is what I have so far but it does not work properly (If there are
> other attributes in the tag it returns them as part of the URL.)
>
> preg_match_all('#]*href\s*=\s*([\"\']+)([^>]+?)(\1|>)>Continue#i',
> $html, $matches);
>
> It needs to be able to extract the URL and disregard arbitrary
> attributes in the HTML tag
>
> Test it with the following examples:
>
>
>
>
>
> onlick="someFunction('foo','bar')">Continue
>
> Please reply
>
> Your help is much appreciated.
>
> Thanks in advance,
> Brad F.
>
>
>
> preg_match_all('#]*href\s*=\s*[\"\']+([^\"\']+?).+?>Continue#i',
> $html, $matches);
>
> I just changed your regex a bit. What your regex was previously doing was
> matching everything from the first quote after the href= right up until the
> first > it found, which would usually be the one that closes the opening
> tag. You could make it a bit more intelligent if you wished with
> backreferencing to make sure it matches against the same type of quotation
> character it matched as the start of the href's value.
>
> Thanks,
> Ash
> http://www.ashleysheridan.co.uk
>
>
>
I appreciate the help. However, when try this I only get the first
character of the URL. Can you double check it please.
Thanks again
--0016e648f67ebb453104769dc7f8--
Re: regex pattern for extracting URLs
am 23.10.2009 19:48:18 von Ashley Sheridan
--=-z1r880vRWzTPB1z7HG+h
Content-Type: text/plain
Content-Transfer-Encoding: 7bit
On Fri, 2009-10-23 at 13:45 -0400, Brad Fuller wrote:
> On Fri, Oct 23, 2009 at 1:28 PM, Ashley Sheridan
> wrote:
>
> > On Fri, 2009-10-23 at 13:23 -0400, Brad Fuller wrote:
> >
> > I'm looking for a regular expression to accomplish a specific task.
> >
> > I'm hoping someone who's really good at regex patterns can lend a quick hand.
> >
> > I need a regex pattern that will grab URLs out of HTML that have a
> > certain link text. (i.e. the word "Continue")
> >
> > This is what I have so far but it does not work properly (If there are
> > other attributes in the tag it returns them as part of the URL.)
> >
> > preg_match_all('#]*href\s*=\s*([\"\']+)([^>]+?)(\1|>)>Continue#i',
> > $html, $matches);
> >
> > It needs to be able to extract the URL and disregard arbitrary
> > attributes in the HTML tag
> >
> > Test it with the following examples:
> >
> >
> >
> >
> >
> > onlick="someFunction('foo','bar')">Continue
> >
> > Please reply
> >
> > Your help is much appreciated.
> >
> > Thanks in advance,
> > Brad F.
> >
> >
> >
> > preg_match_all('#]*href\s*=\s*[\"\']+([^\"\']+?).+?>Continue#i',
> > $html, $matches);
> >
> > I just changed your regex a bit. What your regex was previously doing was
> > matching everything from the first quote after the href= right up until the
> > first > it found, which would usually be the one that closes the opening
> > tag. You could make it a bit more intelligent if you wished with
> > backreferencing to make sure it matches against the same type of quotation
> > character it matched as the start of the href's value.
> >
> > Thanks,
> > Ash
> > http://www.ashleysheridan.co.uk
> >
> >
> >
>
> I appreciate the help. However, when try this I only get the first
> character of the URL. Can you double check it please.
>
> Thanks again
I think it's probably the first ? in ([^\"\']+?)
Remove that and it should do the trick
Thanks,
Ash
http://www.ashleysheridan.co.uk
--=-z1r880vRWzTPB1z7HG+h--
Re: regex pattern for extracting URLs
am 23.10.2009 19:54:34 von israelekpo
--0016e6d77e12979f2204769de79a
Content-Type: text/plain; charset=UTF-8
On Fri, Oct 23, 2009 at 1:48 PM, Ashley Sheridan
wrote:
> On Fri, 2009-10-23 at 13:45 -0400, Brad Fuller wrote:
>
> > On Fri, Oct 23, 2009 at 1:28 PM, Ashley Sheridan
> > wrote:
> >
> > > On Fri, 2009-10-23 at 13:23 -0400, Brad Fuller wrote:
> > >
> > > I'm looking for a regular expression to accomplish a specific task.
> > >
> > > I'm hoping someone who's really good at regex patterns can lend a quick
> hand.
> > >
> > > I need a regex pattern that will grab URLs out of HTML that have a
> > > certain link text. (i.e. the word "Continue")
> > >
> > > This is what I have so far but it does not work properly (If there are
> > > other attributes in the tag it returns them as part of the URL.)
> > >
> > >
> preg_match_all('#]*href\s*=\s*([\"\']+)([^>]+?)(\1|>)>Continue#i',
> > > $html, $matches);
> > >
> > > It needs to be able to extract the URL and disregard arbitrary
> > > attributes in the HTML tag
> > >
> > > Test it with the following examples:
> > >
> > >
> > >
> > >
> class="link">Continue
> > >
> > > onlick="someFunction('foo','bar')">Continue
> > >
> > > Please reply
> > >
> > > Your help is much appreciated.
> > >
> > > Thanks in advance,
> > > Brad F.
> > >
> > >
> > >
> > >
> preg_match_all('#]*href\s*=\s*[\"\']+([^\"\']+?).+?>Continue#i',
> > > $html, $matches);
> > >
> > > I just changed your regex a bit. What your regex was previously doing
> was
> > > matching everything from the first quote after the href= right up until
> the
> > > first > it found, which would usually be the one that closes the
> opening
> > > tag. You could make it a bit more intelligent if you wished with
> > > backreferencing to make sure it matches against the same type of
> quotation
> > > character it matched as the start of the href's value.
> > >
> > > Thanks,
> > > Ash
> > > http://www.ashleysheridan.co.uk
> > >
> > >
> > >
> >
> > I appreciate the help. However, when try this I only get the first
> > character of the URL. Can you double check it please.
> >
> > Thanks again
>
>
> I think it's probably the first ? in ([^\"\']+?)
>
> Remove that and it should do the trick
>
> Thanks,
> Ash
> http://www.ashleysheridan.co.uk
>
>
>
Hi Brad,
I agree with Jim.
Take a look at this. It might help.
$xml_string = <<
class="link">Continue
onclick="someFunction('foo','bar')">Continue
TEXT_BOUNDARY;
$xml = simplexml_load_string($xml_string);
$continue_hrefs = $xml->xpath("//a[text() = 'Continue']/@href");
print_r($continue_hrefs);
?>
--
"Good Enough" is not good enough.
To give anything less than your best is to sacrifice the gift.
Quality First. Measure Twice. Cut Once.
--0016e6d77e12979f2204769de79a--
Re: regex pattern for extracting URLs
am 23.10.2009 19:54:40 von Brad Fuller
--00163645726ef4e89c04769de7d6
Content-Type: text/plain; charset=ISO-8859-1
On Fri, Oct 23, 2009 at 1:48 PM, Ashley Sheridan
wrote:
> On Fri, 2009-10-23 at 13:45 -0400, Brad Fuller wrote:
>
> On Fri, Oct 23, 2009 at 1:28 PM, Ashley Sheridan
> wrote:
>
> > On Fri, 2009-10-23 at 13:23 -0400, Brad Fuller wrote:
> >
> > I'm looking for a regular expression to accomplish a specific task.
> >
> > I'm hoping someone who's really good at regex patterns can lend a quick hand.
> >
> > I need a regex pattern that will grab URLs out of HTML that have a
> > certain link text. (i.e. the word "Continue")
> >
> > This is what I have so far but it does not work properly (If there are
> > other attributes in the tag it returns them as part of the URL.)
> >
> > preg_match_all('#]*href\s*=\s*([\"\']+)([^>]+?)(\1|>)>Continue#i',
> > $html, $matches);
> >
> > It needs to be able to extract the URL and disregard arbitrary
> > attributes in the HTML tag
> >
> > Test it with the following examples:
> >
> >
> >
> >
> >
> > onlick="someFunction('foo','bar')">Continue
> >
> > Please reply
> >
> > Your help is much appreciated.
> >
> > Thanks in advance,
> > Brad F.
> >
> >
> >
> > preg_match_all('#]*href\s*=\s*[\"\']+([^\"\']+?).+?>Continue#i',
> > $html, $matches);
> >
> > I just changed your regex a bit. What your regex was previously doing was
> > matching everything from the first quote after the href= right up until the
> > first > it found, which would usually be the one that closes the opening
> > tag. You could make it a bit more intelligent if you wished with
> > backreferencing to make sure it matches against the same type of quotation
> > character it matched as the start of the href's value.
> >
> > Thanks,
> > Ash
> > http://www.ashleysheridan.co.uk
> >
> >
> >
>
> I appreciate the help. However, when try this I only get the first
> character of the URL. Can you double check it please.
>
> Thanks again
>
>
> I think it's probably the first ? in ([^\"\']+?)
>
> Remove that and it should do the trick
>
> Thanks,
> Ash
> http://www.ashleysheridan.co.uk
>
>
>
That did the trick. Thanks Ash you are awesome!
Also thanks Jim for your suggestion. I may move to SimpleXML if the project
grows much bigger. But for now I was looking for a nice one liner and this
is it.
Cheers,
Brad
--00163645726ef4e89c04769de7d6--
Re: regex pattern for extracting URLs
am 23.10.2009 20:08:53 von Brad Fuller
On Fri, Oct 23, 2009 at 1:54 PM, Israel Ekpo wrote:
>
>
> On Fri, Oct 23, 2009 at 1:48 PM, Ashley Sheridan
k>
> wrote:
>>
>> On Fri, 2009-10-23 at 13:45 -0400, Brad Fuller wrote:
>>
>> > On Fri, Oct 23, 2009 at 1:28 PM, Ashley Sheridan
>> > wrote:
>> >
>> > > =A0On Fri, 2009-10-23 at 13:23 -0400, Brad Fuller wrote:
>> > >
>> > > I'm looking for a regular expression to accomplish a specific task.
>> > >
>> > > I'm hoping someone who's really good at regex patterns can lend a
>> > > quick hand.
>> > >
>> > > I need a regex pattern that will grab URLs out of HTML that have a
>> > > certain link text. (i.e. the word "Continue")
>> > >
>> > > This is what I have so far but it does not work properly (If there a=
re
>> > > other attributes in the tag it returns them as part of the URL.)
>> > >
>> > >
>> > > preg_match_all('#]*href\s*=3D\s*([\"\']+)([^>]+?)(\1|>)>Co=
ntinue#i',
>> > > $html, $matches);
>> > >
>> > > It needs to be able to extract the URL and disregard arbitrary
>> > > attributes in the HTML tag
>> > >
>> > > Test it with the following examples:
>> > >
>> > >
>> > >
>> > >
>> > > class=3D"link">Continue
>> > >
html"
>> > > onlick=3D"someFunction('foo','bar')">Continue
>> > >
>> > > Please reply
>> > >
>> > > Your help is much appreciated.
>> > >
>> > > Thanks in advance,
>> > > Brad F.
>> > >
>> > >
>> > >
>> > >
>> > > preg_match_all('#]*href\s*=3D\s*[\"\']+([^\"\']+?).+?>Cont=
inue#i',
>> > > $html, $matches);
>> > >
>> > > I just changed your regex a bit. What your regex was previously doin=
g
>> > > was
>> > > matching everything from the first quote after the href=3D right up
>> > > until the
>> > > first > it found, which would usually be the one that closes the
>> > > opening
>> > > tag. You could make it a bit more intelligent if you wished with
>> > > backreferencing to make sure it matches against the same type of
>> > > quotation
>> > > character it matched as the start of the href's value.
>> > >
>> > > =A0 Thanks,
>> > > Ash
>> > > http://www.ashleysheridan.co.uk
>> > >
>> > >
>> > >
>> >
>> > I appreciate the help. =A0However, when try this I only get the first
>> > character of the URL. =A0Can you double check it please.
>> >
>> > Thanks again
>>
>>
>> I think it's probably the first ? in ([^\"\']+?)
>>
>> Remove that and it should do the trick
>>
>> Thanks,
>> Ash
>> http://www.ashleysheridan.co.uk
>>
>>
>
> Hi Brad,
>
> I agree with Jim.
>
> Take a look at this. It might help.
>
>
>
> $xml_string =3D <<
>
> =A0
> =A0
> =A0
> =A0
> =A0
> =A0
rlA.html">Continue
> =A0
rl2.html">Brad Fuller
> =A0
rlB.html">Continue
> =A0
rl4.html">PHP.net
> =A0
rlC.html"
> class=3D"link">Continue
> =A0
> href=3D"http://example.com/path/to/urlD.html"
> onclick=3D"someFunction('foo','bar')">Continue
> =A0
> =A0
>
> TEXT_BOUNDARY;
>
> $xml =3D simplexml_load_string($xml_string);
>
> $continue_hrefs =3D $xml->xpath("//a[text() =3D 'Continue']/@href");
>
> print_r($continue_hrefs);
>
> ?>
>
Thanks, I'm sure I will use this at some point in the future :)
--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php
Re: regex pattern for extracting URLs
am 23.10.2009 21:17:11 von Paul M Foster
On Fri, Oct 23, 2009 at 01:54:40PM -0400, Brad Fuller wrote:
> Thanks Ash you are awesome!
Brad, you're violating list rules. We never say that kind of thing to
Ash *where he can hear it*. Only behind his back. ;-}
Paul
--
Paul M. Foster
--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php
Re: regex pattern for extracting URLs
am 23.10.2009 21:18:00 von Ashley Sheridan
--=-H1+WYekb7z79R3vsFXtA
Content-Type: text/plain
Content-Transfer-Encoding: 7bit
On Fri, 2009-10-23 at 15:17 -0400, Paul M Foster wrote:
> On Fri, Oct 23, 2009 at 01:54:40PM -0400, Brad Fuller wrote:
>
> > Thanks Ash you are awesome!
>
> Brad, you're violating list rules. We never say that kind of thing to
> Ash *where he can hear it*. Only behind his back. ;-}
>
> Paul
>
> --
> Paul M. Foster
>
Well, it makes a refreshing change, off list people just want to insult
me :p
Thanks,
Ash
http://www.ashleysheridan.co.uk
--=-H1+WYekb7z79R3vsFXtA--